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ABSTRACT 

As part of a project on assessing deep understanding 
of subject matter, a study was conducted to assess students' 
knowledge of history by focusing on essay writing. Reading a provided 
text was incorporated into the assessment procedure. Ratings from 
five high school hititory teachers were compared with those of four 
English teachers for 85 essays frcn llth-grade advanced placement 
students in a suburban high school. No significant differences in 
student performance were found for the two text passages or question 
type (brief or extended) in ratings of either group. English and 
history teachers were looking at student papers in fundamentally 
similar ways, but scoring criteria were not adequate in that they 
focused on what teachers said they valued, rather than what they 
actually would write themselves. Using an expert-novice model, new 
questions were developed, and the new test was administered to 250 
11th graders and scored under new scoring rubrics. Higher interrater 
reliabilities encouraged researchers to train four history teachers 
in the new scoring approach. Ongoing research is examining the 
utility of the new scoring focus for high school students in two 
school districts. The study illustrates the complexity of developing 
new and useful assessments. Eight tables summarize test development 
principles, and a 30-item list of references is included. (SLD) 
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Introduction^ 



For at least a quarter of a century, educators and critics have raised 
conceptual and technical questions about standardized achievement tests (Strenio, 
1981). And for the most part, the public and its policy makers have Ignored these 
ululatlons and continued to believe in the accuracy and usefulness of these 
measures, dismissing technical concerns as abstrusely academic and teacher 
complaints, at minimum, as self-serving. However, recent reform efforts, stemming 
from A Nation At Risk (National Commission for Educational Excellence, 1983) and 
other dark reports of American educational quality, have directed renewed attention 
and investment in achievement outcomes. With the statement of national 
educational goals by the President in 1990 and the governors of the fifty states in 
1989, and the President's promise to measure achievement in grades 4, 8, and 12, 
standardized achievement tests are about to become national educational polic>'. 
The consequences of error in test design and interpretation are inestimably higher 
than in the past, for such measures will exert dramatic control on the public school 
curriculum, cn what tests are published, and on what is taught. Information from 
achievement measures must «iswer three questions: What is the quality of our 
students' achievement? How can achievement be improved? Why cant present 
tests do the job? For the purposes of accountability and instructional improvement, 
the vast majority of existing standardized achievement tests are wholly inadequate. 
They create the wrong expectations and Incite inaccurate inferences in terms of 
policy action. They are inappropriate in at least three central ways: their 
underlying theory, their content, and their procedures. These assertions deserve at 
roast brief elaboration. 

The measurement assumptions of standardized tests re'y on models based in 
theories of suble individual differences. These models posit a general construct 
such as mathematics ability or reading comprehension and require at least two 
conditions for Its measurement: (a) substantial variation among people on the target 
test in order to differentiate scores, that is, scores on the 78th or 64th percentile are 
intended to reflect different levels of performance; and (b) stability of measurement 
for individual performance for accurate predictioa When mapped against the 
requirements for assessing an individual's educational ImpiOYCmfnt or the impact of 
systemic educational reform intended to assure all students' success, these 
instruments do not measure up. Reports from most standardized t-^sts obscure the 
meaning of the test scores from the teacher, the student, and the public. We may 
know the relative position of individuals and school districts compared to other 
individuals or school districts, but we do not know what level of performance any 
given score describes. Further, even under the best conditions, educational reform 
has weak effects. So to detect change, progress toward national goals, for example, 
achievement measures must be created that are sensitive to minor, b?it real 
differences in performance. Tests should tell us who has changed in ability to 
perform particular tasks at described levels of expertise. Standardized achievement 
tests do not tell us what we should want to know. 

The problem of interpreting these tests is amplified by the way their 
consent is selected. A major problem is content sampling within a particular subject 
matter, such as history or mathematics. Most subject matter measures are 

1 We wish to thank Tom KerinsTjohn Craig, and Carment Chapman, of the Illinois State 
Board of Education; Bob Hill, of the Springfield Public Schools; Lynn Winters of the Palos 
Verdes School District; and the many principals and English and history teachers who 
participated In or helped with this project. In addition, we would like to thank colleagues at 
UCLA who helped with various aspects of the study: Pam Aschbacher, Jamal Abedi, Joan 
Herman, Edys Quellmalz, Meri Wittrock, Simon Chang, Yujing Ni, Regie Stltes, Kyung-Sung 
Kim, and Rebeuca Frazler. 



commerciaUy available and are intended to be sold to school dlstrlas and states. To 
^ w^^' companies must attempt to include a sufficient number of 

topi« with broad appeal in any subject matter: A common result, as preZed two 
decades ago (Popham & Husek, 1969), is a content-curriculum mismatch, where Jhe 
overlap on test content and curriculum varies by district, school, or classroom This 
phenomenon has been documented in many specific subject fields for examniP fn 
mathematics by Floden and his colleagues (198?) In p aSlc2 eiS^', i misSh 
means that certain topics that are untreated in the cumculum S^ven dSiooms 
and scho^s will be included on the test. On the other hand, even topto 
emphasized in teaching may only be superficially measured because of time 
constraints. Both types of errors result in misrepresenting students' aaual 
achievement. One solution to this problem has been to encourage teachers to adaot 

I^f "^'^^ '^^^^^"^ P^<^^^ "alignmenoTcourse o^^^^ 

action that cedes enormous and inappropriate power to the developers of such 

r-ioH ^ ^f^*i?^' "»ore global content issue is created by the pressure to test in a 
relatively Umlted number of subjea matters. Such choices have been made as a 
matter of course to save money and time as well as to constrain the number of 
measures on which public accountability will be based. As a rule, distriSs and state, 
commonly select an essential core of subject matter, often the a eak oSn^^^^^^^ 
S!!?"^'!^. ^? mathematics. Teachers and school policy makei; aS 
instructional Ume to focus on the goals to be measured. One consequenceof this 
adaptation may be a reduction in time for untested subjea fields: forelSi l^iJwaL 
the humanities, the arts, and the sciences. This reduction occure lo^X tTff^f' 
'^tS^nZT^^ accountable aspects of the curriculum, but also befaSeVf the 
widespread, pernicious belief that students must learn the "basics" before thev can 
profit from exposure to other subject mattera and more complex intellectuS 
proc«js^. Pmicularly for poor performing students, opportSnitiS^in a^de ran^re 
of subjert matter are foregone (6akes, 1986). The result iTobvSut-an eXa ffi 

tKiJ^cSL^^r"^"'^"^^^^ 

The constraints on administration and scoring of standardized tests also 

l^Ti!^:^^ "^^"^^ Tests have been developed withSrict 

time bound^ries, partly in an effort to reduce testing time, partly frSn^historicS 

SviS/' To ^'^^ "if.r" V^^^^^ purposes^WeVenUaTe '^^^^^ 

Jfr?- Obtain differentiated and reUable responses, it is better M Umlt test 

time and to expose students to many short items. More test itemsal^ n^ean 
topics can be covered. MulUple-choice items are the mo7t fr^^en ?J Sred 
achievement testing format because they are time semitive res?^^^^^ a 

relatively large set of items, have acceptable psychometric pr?p2Ses Klfow 
economic scoring approaches. How do theWchoices affect^.tudents? Multip?e 

info^ItiZ^n?;^n^^^^ ^" "^^^ P^^«^"' The formaVfrS how 

1h ffS?^ 2 ^ presented, learned, and retained. These tests assess learning in an 
arUficlal, decontextualized manner that is remote from how students learn or wm 
apply knowledge in the future. These tests are likely to reduce stud^nTmot vm^^^^ 
nf S^liJJr'r'^ ^f^ "^^^r/° ^'^^^^ Such formats also co"veya7^se sense 

school i^^ru'clSS ^^^^t^tsToSr^^^^^ LSI ^e: 

t),ian for learning, the simple problem of showing "improvement" in adSei?m^^^^^ 

^ design^rogrLs wholl^fi^are dS^ 
S«ut in i^^^^TJ"?""' u ' l^ve few acceptable options. They may 

penist in doing the best they can, but may continue to see public confidence erode 
when test scores do not respond to their efforts. They may reac^^in e^ic^^^^^^ 
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1987; Liim, G";;^' ^^^p^^^^' results. Schools may 



New Choice Pohits 



The exoectadon that accountability measures will directly and productively 

tests *«nctlom wun uruc "J"^ measures that can provide information at the 

toliiS^^^^^^ not <"ven large proport.om of 

instructional time from lesming tasks. 

To meet the legitimate concerns for accountabUity and resulting instJu9t|onal 

iTS::^hy^.nu^Kr?- ^^^^^^^ 

jinnmarhes to testinR are: (a) their focus on important ana teacnaoie |fauuiii5 
S« (^Uhe cKnW we can place in tfieir measurements, and (c) the 
appropriateness of cues they provide for instruction. 

Cognitively Sensiave Assessment 

If we start with the notion that tests should measure significant learning in a 
way that suppo^Sred perfonn^^^ we are immediately led to a reveijal of 
orl^pr^nrartiM Instead of having tests constrain instruction, assessment 
S S.?S?hou^^^^ map direaly on significant features of learning. 

? W ^^^^^^^ skilled experts can tell whether eamei.^;^^^^^ 

ir««r«« on rwide ranffe of intellectual tasks. Our problem is to transmute the 
?S as^^if tlSTo^^^^ into pr^edures suiuble for use in arge- 

Sile^SS^f We muS shift our View from the measurement of broad constructs 
^L^S^cni oAmpomni and described ^^J^^SKX^^^^ 
knowledge acquisition, deep understanding, and problem solving, i nese processes 
ml^7tS^SSed as they are embedded in various tasks and content domains; 



however, our assessment strategies may attempt to capture atttibutes of 
performance that transfer aaoss subject matter domains. In our CRESST project on 
assessing; deep understanding of subject matter, we conducted research designed to 
transfer knowledge that developed in learning research and apply it to the problem 
of assessing the understanding of history. What will follow is a chronological 
description of the developmental history of our project, interpolated by discussions 
of the generalizable problems confronting developers of new approaches to 
assessment. 

Project Goals and Plans: Developing New Criteria for Scoring Writing in 
History 

Stimulated by articulate statements about the importance of knowledge of 
history by Hirsch, Kett, and Trefil (1987) and the dismal performance of American 
students on tests of historical knowledge (P^vitch & Finn, 1987), we decided to 
focus our attention on the measurement of history knowledge, specifically aimed at 
assessing a deeper understanding of history. We conceived of the problem for 
students as a comprehension task dependent upon their ability to generate or 
constmct meaning (Wittrock, 1974) from provided stimuli and by activating students 
prior knowledge. This approach contrasts with the conception of history knowledge 
as a single construct dependent upon the accumulation of separate pieces of 
knowledge. Consequently, we broadened our approach from the usual multiple- 
choice format, building on our research group's considerable experience in 
developing measures of writing skill (Baker, 1987; Quellmalz, Capell, & Chou, 1982). 

Our initial idea was to attempt to expand the content quality scoring rubrics 
used to assess writing and to apply them to subject matter topics in the field of 
history. Extant content quality scoring rubrics have treated content in one of two 
ways: as elaborated detail that contributes to a good essay in holistic scoring; or as 
important, unique material dependent upon the particular topic presented the 
learner. This second conception guides approaches used in scoring Advanced 
Placement Tests in History (Vaughan, 1983) and in primary trait scoring in the 
National Assessment of Educational Progress (1990). In this topic-dependent 
approach, individuals with expertise in the assigned topical area meet and develop 
post hoc standards for the particular set of papers written. The benefit of this 
procedure is the development of scoring scales tliat are particularly appropriate for 
the topic assigned. However, that strength is at once a severe limitation: First, the 
level of specificity required to adapt scoring criteria to a particular topic inhibits 
vhei* more general use for other, 5».:::ilar topics. Thus, every topic possesses a unique 
set of criteria. Combining such particularized assessments across a range of topics or 
over a number of years involves a complex scaling process, based on equating results 
for different topics. Among a number of flaws, a major consequence of scaling is the 
ambiguity of score meaning. A second limitation relates to the inferences for 
instnialon that can be derived from such measures. If every topic requires a unique 
set of criteria, what guidance can be provided to the teacher to inform teaching 
processes to improve student performance? Only if the tasks and scoring criteria are 
made public— released by the test producers— can teachers guide students to meet 
such standards, and then only if the same tasks are used. The trick is to find the 
appropriate level of generality to describe criteria so they are simuluneously 
appropriate for the particular assessment topics and conceived in terms that can 
guide future instructional practice and assessment. 

Goals 

The goals cf our assessment research In the measurement of deep 
understanding of history were: (a) to develop valid formats for eliciting students' 
thoughtful explanations about history concepts; (b) to create and validate content 
quality scoring criteria for students' responses; and (c) to explore these 
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developments in the context of large-scale assessment settings. A longer term 
interest is to communicate the test design characteristics so that they will be helpful 
to the design of effective teaching strategies. 

Strategies 

Target. In light of our technical expertise in writing assessment, our project 
focused on essay writing in histoiy. We believed that the strong tradition for this 
type of task in histoiy instruaion would increase the chances, if successful, of 
widespread acceptance of new assessment strategies. We also determined from 
reviews of plans for state assessment activities that writing in social studies was 
planned for many of the more forward-looking state assessment enterprises (for 
instance, California, Connecticut, Illinois, and Michigan), Pinally, we believed that 
the present approaches used in the scoring of content-focused writing were 
inappropriate both conceptually and practically for tlie dual purposes of measuring 
deep understanding in large-scale settings and providing inferences useful for 
instruction. 

Plan. In order to verify the need for essay scoring systems to assess content 
quality, we first had to determine if content specific scoring criteria for history 
already existed Implicitly in the scoring behavior of history teachers. If so, we would 
identify these criteria, train others to use them, and validate their utility. If not, we 
would explore the literature to infer criteria that might be used. Even though our 
goal was to develop scoring approaches with reasonable generalizability aaoss tasks 
to facilitate instnictional improvement, we decided to limit our studies severely. We 
planned to focus on a grade level (11th grade) and on a single topic area in history, 
for we wished to be sure our findings were well grounded in a defined context. If 
we were encouraged by our results, we planned to test the generalizability of the 
approach: for other subject matter areas, for the age ranges of students for whom 
the approach was useful, and for sets of administration conditions. In sum, we 
anticipated the development of broadly useful assessment approaches as we 
condurted initial research in a restricted environment. 

Our first problem was to identify specific content topics and strategies for 
data collection that would allow us to explore the issues of content quality scoring 
criteria. One requirement was to assure that students had some previous exposure 
to the concepts we plaimed to assess so that they could respond to our tasks. We 
hoped to assign passages in commonly xised textbooks for this purpose. To that end, 
we reviewed textbooks, literature on the teaching of history, and available 
curriculum guides to determine the topics and most desirable sections of secondary 
school textbooks appropriate for our experiments in measurement. Our review of 
textbooks led to unoriginal but nonetheless depressing results. For every topic we 
pursued, we discovered that secondary school texts presented relatively superficial 
treatments, without sufficient concepts and depth of supporting knowledge to allow 
the development of deep understanding. These views have been supported in the 
literature by Beck, McKeown, and Gromoll (1989), Sewall (1987), and FitzGerald 
(1979). We also consulted at length with the staff of the UCLA Center for the Study 
of Teaching and Learning in Histoiy, a collaborative enterprise of the National 
Endowment for the Humanities that brings together experts in history and 
curriculum. 

Goal Redefinition 

Because we were unable to identify suitable text segments for use in the 
assessment, we decided to incorporate the reading of a provided text as part of the 
assessment procedure itself. This decision transformed in a serious way our 
assessment focus. Rather than an exclusive focus on measuring the accumulation of 
information developed over a long period of instruction, we nov/ attended to two 



major content issues: students' ability to read and integrate new information with 
previously learned knowledge, and students ability to explain new ideas using their 
prior knowledge. This transformation placed our work squarely in line with 
cognitive views of language compreheiision (Anderson, Spiro, & Anderson, 1978; 
Rumelhart, 1980; Brown, Bransford, Ferrara, & Carapione, 1983; Kieras, 1985). 
However, we were still driven principally by our subject matter concerns, a fact that 
guidf i the formulation of criteria for the topic and text selection for assessment 
task: displayed in Table 1. 



Table 1 

Criteria for the Selection of History Texts to Assess 



1. Must be a regular and significant piece of the secondary school history 
curriculum in the United States. 

2. Must depend upon primary source material rather than summaries in 
textbook. 

3. Must allow for multiple interpretations and inferences. 

4. Must transcend immediate events and allow students to find relationships 
to other historical and contemporary events. 

5. Must be brief enough to read within a class period. 



Based on the application of these criteria, we decided that original speeches 
or essays composed by historical figures would meet criteria two, three, and five. For 
our iniUal set of studies, we selected the texts of the Lincoln and Douglas debates on 
popular sovereignty and slavery, choices that met the remaining criteria as well. 

Identification of Content Quality Scoring Criteria: The Firirt Pass 

Our goal was to assemble valid criteria to asses& understanding of history 
content. But essay writing consists of both content expertise and communication 
skills. We were well aware and troubled by the high interconclations in the 
literature between subscores on essays of expression skills and content knowledge 
(Baker & Quellmalz, 1980; Langcr, 1984), Although it was obvious that highly verbal 
students would usually learn more about verbally based content areas, we were 
especially interested in discriminating performance between the ignorant facile 
writer with little subject matter understanding and the knowledgeable studerit with 
less developed writing skills. This desire corresponded to the common practice of 
high school teachers, who give both a "content* grade and a "form" grade (e.g., A-/B) 
on student essays. We wanted to focus on the elements that compose the content 
score. 

A related concern was the impart of content knowledge (or lack thereof) on 
the raters' application of scores. We believed that knowledgeable people with 
experience in the subjert matter would be needed to make the levels of distinction 
in which we were interested. Our first empirical study attempted to determine if 
the quality of content in essays, its accurateness, aptness, and structure, would be 
judged similarly by history teachers using implicit but common criteria for quality. 
We would contrast their ratings with these given by English teachers, specifically 
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We would contrast, their ratings with those given by English teacher?, specifically 
teachers trained to score essays in terms of the quality of general writing skiU or 
expression such as organization, style, and purpose. The essays we collected for this 
S were provided b^^^ eleventh-grade Advance Placement (AP) history students 
in a suburban high school. We chose AP students because they would be like.y to 
write "scorable" papers, that is. produce a sufficient quantity of writing to be graded. 
The AP students also had been exposed to an Instmctional sequence on the pre- 
Clvil war period approximately five months earlier, so they would possess some 
background knowledge of the topic. 

The experimental procedures spanned two consecutive days. On the first 
dav students were Kiven a general multiple-choice examination in pre-Civil War 

T^stt^lth^^ validated by six expert history teachers Next students 
completed a background questionnaire describing their grades in English arid social 
studies, self-estimates In ability, interest in wtiting and in social studies, and 
descriptions of teachers' Instmctional and assessment practices in history. On the 
second day. students were randomly assigned to read either the Lincoln or the 
Douglas debate text. After the students completed their reading, they were given 
an e&say question in either a brief or an extended form that asked them to e olain 
the author's main issues and why they were important. Students were allowv . 
minutes to read the text of the speech and to write their essay. The papers w*. ■ 
independently scored by two groups of raters: the English teachers and the history 
teachers. 

Procedures for English teecher raters. One rater group was composed of 
four English instnictors. all highly experienced in rating student essays according to 
holistic and analytic techniques. All had been trained to use the writixjg scoring 
SSes developed at UCLA (Smith. 1978; QuellmalZ; Smith. Winters. & Baker. 1980) 
and subsequently adapted for use in numerous state assessments, research studies, 
and the international comparisons of written composition performance (Baker. 
1987). These scales included four major categories— general competence, essay 
organization, paragraph coherence, and support (meaning det^all)— as well as scales 
for Rraramar and mechanics. We also were interested in the thought processes that 
raters used and their Initial levels of stringency. Thus, we asked raters prior to their 
training to read three sample papers privately, to rate them on a five-point scale, 
and tc comment on their decisions and impressions; comments were tape-recorded. 
Raters also were asked to identify criteria for a good paper. The trairiirig was 
conducted using procedures described by Quellm-^lx (1986) with model papers and 
illusfations of score points. The raters were told explicitly to focus on issues of 
presentation and rhetorical cffealvcness rather than content-specifi :sues, such as 
content accuracy and depth of explanation. Nonetheless, during the training the 
raters Insisted on modiiying the scoring system: They decided ' ndudc as part the 
general competence subscore some index of the student's att on to the specific 
writing task. All raters independently scored each of the 85 

Procedures for the history teachers. Independently, and with no 
knowledge of the English teacher group or their rcsulUng scores, a group of five 
history spedalists was assembled to rate the same set of essays. Two were high 
school Advanced Placement teachers (from a s<:hool different than the dat9 
collection site) and three raters were advanced iradua< 'lide-vts in history. Like 
the English teachers, all history raters were asked ^o as? three essays and to think 
aloud into the tape recorder as they completed this rating tasK. Their actual rating 
instruaions differed dramatically from those given to the English teachers: No 
preexisting scoring scale was used, and no extensive trfiining was conducted to 
det ermine if the history group shared implicit criteria. Each rater was told to give 
each paper two scores. The first score was to reflea how well the essay 
demonst rated serious understanding of the debate text read by the student. The 
second score was to provide an estimate of the essay's general quality, taking into 
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account issues other than the essay's content. These scores conformed to the 
content-form scoring mentioned above. Wc also asked the history group to select 
the ten best and ten worst essays, so that we couid infer from r cholc"^^^^^ 
ODerat<onal criteria they uied to make their judgments. Each history teacher 
independently rated each of the 8S papers, giving each a content quality and an 
overall quality score. Following the rating session, all teachers discussed m a group 
the attributes that distinguished the highest from the lowest rated papers. 

Findings and Interpretations 

Detailed data analyses were conducted; only the highlights will be reported 
here. No significant differences on student performance were found for text 
passage (Lincoln or Douglas) or question type (brief or extended), in the ratings of 
either croMD. Our findings verified the inappropriateness of the existing UCL^ 
scorinlscale for the contint focused task we used. Alpha coefflclenu among raters 
ranged from a low of .52 for mechanics to a high of .75 for general competence (the 
on?score where raters took into account the task content). This finding reinforced 
the need for the development of a content quality scoring rubric. For the history 
raters, ^he alpha coefficient on general quality wai .69 and on coritent quality was 
.75. The generalizabllily ratings for English rater? (4 raters by 4 subscales) was .65 
and for history raters (5 raters by 2 subscores) wa^ .73. An interesting finding was 
that the percentage of exact agreement for scores given in the history group to 
content quality was only 33%, suggesting that no clear set of implicit criteria was 
operating among the history specialists. In addition, a i test was computed between 
average scores given by the history teachers and the history graduate students; 
siKnificantly higher scores were assigned by the secondary school history teachers. 
The conelaUon between general competence scores assigned by English teachers 
and history content quality scores on the same papers was .80, sirnilar to the 
relationship between general competence assigned by the English group and the 
general quality score assigned by the history group (.82). Such data suggested that 
English and history teachers were looking at papers in fundamentally similar ways. 

Unfortunately, the expert knowledge possessed by history teachers did net 
seem to differentiate their Judgments of student essays. But some aspect of special 
knowledge was operating, however faint. A low but significant correlation was 
obtained between the content quality scores of the history teachers and the total 
multiple-choice knowledge score (r » .32, p < .05). Leads for the development of 
content quality scoring criteria had to come from other sources. We then reviewed 
the histoP' raters' tl-iink-aloud ratings and their post-rating discussions of the ten best 
and wor.«.t papers. The historians agreed that the best papers had the qualities listed 
in Table 2. 
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Table 2 

History Specialists' Generation of Criteria 



Established historical context 

Presented a sound thesis early in the paper 

Detail contributed to thesis, was correa, and was not simply opinion 

Avoided absolute judgments 

Presented multiple points of view 

Avoided interpreting the past in terms of present conditions 

Scoring Criteria: Pass Two 

In an effort to explore the utilif / of these criteria, a comprehensive and 
detailed scoring rubric was constructed based on these categories. The 12-category 
scoring scheme comprised the elements in Table 3 below; these elements were to be 
used as scoring dimensions for the papers. 



Table 3 

Scoring Criteria Infened from Ratings of History Papers 

Identification of the Historical Problem /Central Concept 

Depth of Elaboration 

Breadth of Elaboration 

Flexibility 

Fluency/Detail 

Evidence of an Analytical Problem 

Goal Oriencation 

Logical Structure 

Evidence of Historical Analysis 

Autocriticism 

Presentation 

Style 
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Detailed descriptions for each of five scale points for every category were 
prepared. Based on a brief tr/out with raters and reviews by experts, however, we 
deemed that this comprehensive set of categories was too ambitious. A review of 
literature on characteristia of expert knowledge (see Voss, 1978) suggested how we 
could pare the set down to five categories thought to represent critical attributes of 
historical thinking: Historical Context, Depth of Elaboration, Breadth of Elaboration, 
Evidence, and Historical Analysis. In addition, we added two categories related to 
expression. Rhetorical Structure and Mechanics, as well as an overall quality rating, 
General Impression. New scale point descriptions ' /ere generated for each of the 
eight categories and model papers were assembled to Illustrate particular attributes 
for training purposes. Four history raters (three AP history teachers and one history 
graduate student) were trained in t'.e use of the new s>^tem. They spent two days 
rating the same set of 85 eleventh-grade papers used in the first study. Raters were 
observed as they scored papers and were queried about their satisfaction with the 
rating scales and training procedures. Raters had been given the scoring rubric in 
two forms: an extended, multlpaged form with detailed explanations about each 
scoi point for use in training; and an outline of the dimensions. It was expected 
that after the initial training period the raters would use the outline form. However, 
they chose to continue to refer to the extended form, more rigidly adhering to the 
rubric than we expected. Raters reported that they could differentiate among 
categories and that they could also distinguish among criteria for score points (1-5) 
within each category. Raters were highly satisfied with the scoring categories and 
claimed to use similar criteria to score papers produced in their own classrooms. 

Data from the second round of scoring were then analyzed. Unfortunately, 
the findings from these ratings did not significantly advance our research goal. 
Percentage of exact agreement among raters nudged up to about 35 percent, but 
alpha coefficients for rater agreement dropped to around .45. Most disappointing 
were relatively high Intercorrelations (in the .80 range) among rating categories. 
These strong relationships were confirmed by a factor analysis that produced only 
two factors, one factor consisting solely of the mechanics rating and the other 
loading all other categories. These disappointing results forced us to regroup 
intellectually once again. Fortunately, we were able to compare the results from the 
first set of ratings by the five history teachers with this set of scores, since the 
identical student papers were read by both groups of history specialists. The 
categories in our revised rating scale that mostly highly correlated with the overall 
content quality rating from the first experiment were Historical Context, Breadth of 
Elaboration, and Depth of Elaboration; these categories were set aside for future 
exploration. 

We so far had investigated the existence of common implicit criteria for 
content quality ratings, had analyzed the think-aloud protocols of raters, and had 
noted criteria used in Identifying successful student papers. We then had created a 
comprehensive list of content-relevant elements, had reduced them to a smaller set 
of categories for feasibility purposes, and had trained a satisfied group of raters. Yet, 
we had not seemingly made much progress toward our goal. At this point we 
realized that our entire process had been guided in large measure by what history 
specialists said they valued and usually focused upon when they graded papers. It 
became obvious that such descriptions might reasonably be influenced by the raters' 
desires to appear to be comprehensive and thoughtful— -in other words, by the social 
desirability of their answers. 

Scoring Criteria: Pass Three 

A new strategy for developing scoring criteria was employed, using the model 
derived from expert-novice comparisoas (see Chi & Glaser, 1980, for an illustration). 
Rather than focus on what experts said they did, we were going to study their actual 
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nofformance on tasks identical to those provided the students. Three expert 
SKwho were advanced graduate students in history, three secondary «:hool 
history teachers, and three Advanced Placement students were asked to write 
answers to the same essay question used in the study above and to think aloud to 
permit us to assess their processes. The analyses of the essays produced by this 
wocess as well as our analyses of the think-aloud transcripts resulted in some clear 
direction for us in the area of aiteria generation: Our analyses showed that all 
experts and some teachers used the elements in Table 4 to construct their essays. 



Table 4 

Elements Used by All Experts and Some Teachers in Essay Constmaion 



A strong problem or premise that directed a focused answer 

Use of prior knowledge, including principles as well as facts and 
events for elaboration 

Text references (i.e., Lincoln speech) 

Explicit effort to show interrelationships 



In contrast, very bright but relatively inexperienced students and some 
teachers leaned heavily on the text in two ways. First, they often simply 
paraphrased or even restated the text in their answer. Second, they tried to cover 
all elements rtiscussed in the text and were unable to distinguish between more and 
less important details. As a result of this analysis, a scoring scheme was developed 
that included all of the elements in Table 4, augmented by an overall general 
impression score. We were ready for new data collection 

Rethinking Our Task 

The first major redireaion of this project occurred because of the paucity of 
qual^.ty textbooks and resulted in turning this assessment research toward the dual 
Roals of measuring understanding and knowledge acquisition in the context of a 
particular subject matter corpus. The expert-novice analyses reshaped our focus in a 
second major way. If we accepted that prior knowledge in subject matter was 
essential to both premise-driven and elaboration components of quality of 
understanding, then it was clear that we should design our a'^sm«nt situations to 
include explicit supports to enable students to access such ii formation. We 
believed that we could do this ir any number of ways and decided to explore a 
range of options, details of which we will expose below. More Importantly, we 
perceived that this decision dramatically revised our view of assessment. We 
decided that the assessment situation itself should help students to perform the best 
that they could. We had moved into the blurry territory between learning and 
testing. 

Revising the situation. Our next step fas to create new qi'estions to relate 
to the class of expert behaviors we had proposed as criteria. We decided to have all 
students read both the Lincoln and the Douglas texts to permit them to use 
comparison as a rhetorical structure. We developed two variations of essay 
questions, or prompts, which we experimentally crossed: One treatment condition 
included a narrative context for the prompt and asked the student to imagine being 
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in the pre-Clvll War period and focus on an imaginary cousin as the audience for the 
essay; the other prompt presented the task as a more typical school assignment with 
the teachers as the implicit audience. A second set of treatments varied the 
instructions given to the student to assist their access to relevant prior knowledge. 
Although both conditions explicitly directed students to use their previous 
understanding and knowledge about the historical period in answering the essay 
question, one condition asked a series of stepped, short-answer questions to be 
completed before the student began to write the essay (see Table 5). 



Table 5 

Sample Prompt: Narrative Version 



Topic: 

Imagine thai it is 1858 and you are an educated citizen living in Illinois. 
Because you are interested in politics and always keep yourself well 
informed, you make a special trip to hear Abraham Lincoln and Stephan 
Douglas debating during their campaigns for the Senate seat representing 
Illinois. 

1) Unlike other tests, we hope you really will try to imagine yourself in the 
historical period of the debates, so take a couple of minutes to describe 
yourself, your family, and your work. (Spend about 2-3 minutes.) 

2) As a well-informed citizen , you are aware of the many important events, 
laws, and court decisions that relate to the debates. List as many of these as 
you can. (Spend about 3-4 minutes.) 

3) List, if you can, some principles that underlie our form of government 
and that are relevant to the debate. (Spend 3-4 minutes.) 

4) While listening to the debates, you begin to think about the major 
problems confronting the nation Some of these problems relate to 
principles upon which our government was founded. List the major 
problems you can think of. (S^end about 3-4 minutes.) 

5) After the debates, you return home to find your cousin from England 
who has come to the U.S. for a visit. Your cousin asks you about some of the 
problems that are facing the nation at this time. Write the answer that you 
would give to your cousin, telling him/her about at least two problems that 
you feel are important. You can write this either like a regular essay or like a 
story. Just be sure to give your cousin the clearest picture you can. You may 
use any of the information you've identified above in your answer. 

Be sure to describe each of the problems clearly and tell your cousin about 
events, laws, court decisions, and major principles of U.S. government that 
are related to the problem. Also explain the different solutions that are 
proposed to the problem, and give an example of what might happen if 
these solutions were adopted. 

As a conclusion to your paper, write a brief summary that integrates the two 
problems and states your own position on the whole topic. 
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We also developed a prior knowledge test, basing it on the broad model 
developed by Langer (1984), for two purposes. First, we wanted to help students 
• access relevant prior knowledge; second, we wanted to look at the relationship 
between that measure and rated use of prior knowledge in the essay. This 20-item 
test was created using a set of specifications to control the nature of the content 
queried. Students were to write brief descriptions or definitions for each of the 
terms provided, some of which were facts and events (e.g., Dred Scott decision), and 
some of which were at the principle (or at least concept) level (e.g., sectionalism). 
A few terms were inelevant to the passage, and some were only tangentially 
relevant. 

The new test administration sequence required two days. On the first day 
the students were to complete a personal Information form (including details about 
their interests, age, etc.) and the 20-ltem prior knowledge measure. They then were 
to read the Lincoln and Douglas text segments and complete a short (14-item) 
multiple-choice test on Information in the speeches. On the second day, they were 
to receive the essay question, write about 45 minutes, and complete a short 
debriefing questloimaire that asked for their reactions to the testing and for their 
estimates of their performance on the set of tasks tested. Following a pilot test in 
two Los Angeles classrooms, we tried the new assessment package in twelve 
classrooms in Springfield, Illinois.^ 

The Illinois Study 

The puipose of the Illinois study was to test the assessment procedures 
under large-scale assessment conditions and to obtain dau to bear upon the /alidlty 
of our findings. Here we have space for only a short description and discussion of 
this study. In brief, 250 students in 11th grade participated, equally assigned from 
AP, college preparation, and regular classes. Two full class periods were allowed for 
the assessment. Students were told they were participating in a UCLA study to 
develop new measures for history. Since there were four treatment variations 
(stepped essay prompts/short prompts/narrative context/school context), students 
received their packets assigned at random within each classroom. On-site observers 
from UCLA administered the materials and collected information from teachers 
about their views of students' relative strengths in history, reading, test taking, and 
writing, and information about each teachers' instructional efforts in the topic area. 
In addition, we collected data from transcripts that reported students' course 
experience, grade point averages, and standardized test scores in writing, social 
studies achievement, and reading comprehension. 

To obtain results, prior knowledge scoring rubrics were developed and 
applied to student responses. Scores ranged from 3, a fully elaborated answer, to 1, 
an incorrect or incoherent response. Two graduate students were trained to use the 
prior knowledge rubric and achieved .96 interrater reliability across the total 
measure (individual item agreements for the 20-itera measure ranged between .70 
and .96; 15 items had at least .86 agreement and only 2 fell below .80). Essays were 
rated using the new scoring rubric presented in Table 6. 
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Table 6 

Elements of Cognltively Sensitive Assessment Scoring Rubric 



Problem Focus 

Prior Knowledge: Principles and Facts 
Text Reference 
General Impression 



This time our empirical results were encouraging. Interrater reliabilities for 
the essay subscales ranged between .85 and .98. Intercorrelations among subscales 
were found between .0 and .60, supporting the premise that different aspects of 
student content quality were being assessed. Our findings also shed some light on 
the validity of the r.'bric. First, we determined that the measures reflected the 
different ability levels of the sample, with AP students scoring twice as high as the 
slower students on prior knowledge measures and on overall essay scores, and more 
than three times higher on use of principles in the essay. Our findings also showed 
strong relationships between teacher judgment of overall student achievement in 
history and our data (r s .42 for essay, .63 for prior knowledge). Our measures and 
standardized tests correlated .73 and .43, a variation based upon standardized test 
content. 

Scoring Criteria: Pass Four 

We reviewed our findings and decided it was time to test whether regular 
history teachers could be trained to use the cognitive scoring scheme. We also 
decided to revise the scale in a number of ways: to add categories for 
misconceptions and interrelationships, since in our own discussions we had not 
found a place in our system to take such concerns into account; and to refine the 
scale points for principle and problem focus. The categories in the scoring rubric are 
displayed in Table 7. 
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Table 7 

Cogiiltive Assessment Scoring Rubric (1989) 



Presence of Problem Focus 
Prior Knowledge: Principles 
Prior Knowledge: Facts and Events 
Text 

Intenelationships 
Misconceptions 
General Impression 



We then conducted a training session with four high school history teachers 
to test the feasibility of our modified scoring approach. The training took 
approximately four hours, followed by the scoring session. Once again, we were 
very encouraged by our results. The prior knowledge measures and the essays were 
found to be reliably scored by teachers. Slightly lower interrater agreement overall 
was found for the high school teachers compared to the level obtained by project 
research assistants (alpha « .93 instead of .96). The interrater rel.«bilities on the 
esray subscales for teachers were in the .80-.90 range, except for the newly added 
misconception category (.68). Correlations between the prior knowledge measure 
and related elements of the scoring scheme were all reasonably high, averaging 
around .59, except for misconceptions (-.20) and text material (-.28). We conducted 
a fartor analysis on essay subscales, and two major faaors emerged. One factor 
included overall scores on content quality, the use of principles-based prior 
knowledge, premise-focused writing, and interrelationships. The second factor 
included misconceptions, the use of facts, and the use of text-based material. 
Although we are not completely convinced that this factor structure is sensible, the 
configuration of elements as it relates to the cognitive construction of meaning 
(factor one) and of the application of disconnected, and perhaps incorrea, 
information (factor two) is provocative. 

Next Steps 

Research subsequent to the Illinois study has been undertaken to verify the 
utility of the scoring system across topics, age ranges, and test administration 
conditions. We are looking at the performance of 9th-, 10th-, and llth-grade 
students in two school districts. DaU have been collected and are presently under 
analysis using two additional assessment topics. Both of these topics are drawn from 
the pre-Revolutlonary War period and include texts by Paine, Henry, and Inglis. In 
addition, new materials have been developed for an extended assignment that 
involves Long and Roosevelt texts from the Depression period and incorporates as 
well additional resource materials for students' optional use. We anticipate a total of 
five hours will be needed for the assessment. 
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Limitations and Cautions 



We have recounted the details of this effort to provide some insight into 
how assessment systems might be developed to reflect better the ways students 
actually learn and integrate subject matter material into their repertoires. We 
detailed our troubles and dead ends to demonstrate that the process of developing 
new kinds of useful and valid achievement measures is difficult and time consuming. 
New approaches to assessment are essential, but their development must be 
grounded in a theoretical view of learning. Establishing the validity of such new 
measures is also a difficult proposition. At least three major problems exist. One 
difficulty is the circular nature of new test development. Measures need to relate to 
but not be too strongly predicted by existing measurement strategies. A second 
problem with "deep understanding" tasks is the clear lack of systematic experience 
for the average student. Most students reported to us that our tasks were unusual 
for them. Their overall performance levels were exceptionally poor. To determine 
if our measures are truly valid (that is, if they re ed the desired class of learning), 
experimental studies must be constructed where -wUdents are trained explicitly in 
the process of integrating specifically presented material with various typ-es of prior 
knowledge. Third, and most difficult, an optimal level of generality for task 
descriptions and scoring criteria is needed. This level must be sufficiently detailed to 
control raters' scoring behavior and to be valid for specific tasks. It must l^e 
sufficiently general to provide cues for teachers to use in plarmlng and 
implementing instruction. A rough approximation of how such information can be 
economically displayed is provided in the specifications presented in Table 8. Such 
specifications would be augmented by detailed scoring rubrics with scale point 
definitions and also by a set of student papers illustrating, on different topics, 
various levels of proficiency. Cleariy, a new program of psychometric research is 
needed. In the Interim, we suggest that validity studies include criterion analyses 
by experts, experimental training studies, multiple measures of student learning 
processes, and demonstrations of statistical and conceptual connections to other 
reasonable estimates of performance, even including standardized tests. 

We know that tests have di Iven instruction in the past. Can tests of the sort 
we are developing do so in a productive rather than a destrualve way? What 
evidence do we have that teachers of history focus on the integration of new 
knowledge with prior information— the view that learners construct meaning? Are 
su^h tasks within the capability of all students? When we are cor^stantly bombarded 
with stories that students don't know where the Pacific Ocean or the half-century 
in which World War I occurred, is it naive to think that they can accumulate 
knowledge and use it to make inferences and explanations. These questions must 
be pursued. We believe that there are specific next steps to be accomplished. A 
major challenge is the development of a new theory of test design and validation, 
one that emphasizes Individual learning rather than individual differences. Test 
designers must recognize that the measurement of significant processes takes 
significant time, and consequently tests of many short items and broad content 
sampling may need to be supplanted or supplemented by fewer more complex 
assessment situations. We need to develop concepts that will allow teacheK to 
understand how to use such measures as an integral part of their instruction. Finally, 
we must get ready for the serious task of educating policy makers and the public 
about new models of assessment. We must counsel patience and anticipate that 
results are going to look worse, especially with new challenging measurement 
approaches, before they look better. When improvements eventually occur on 
cognitive measures such as those we have explored, we want them to reflect real 
and trustworthy learning for all students. 
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Table 8 

Specifications for Writing Tasks 



Discourse Type 

Informative writing 
Subgenre 

Explain/infer 

Major Cognitive Process 

To demonstrate the acquisition of new knowledge or concept by 
contextualizing and elaborating position using prior knowledge 
(principles and faas) 

Writing Process Measured 

Drafting 
Audience 

Imaginary, peer 
Topic Range 

Subject matter based 

History: A summary of major position by opposing statesmen 
Information Given in Prompt 

History: Text of speeches or essays written by historical figures 
(e.g., Lincoln) 

Format 

Brief text 

Prior knowledge cuei: Consisting of appropriate and 
inappropriate terms for specific processes, faas, or principles 
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Table 8, continued 



Amount: 

2 or 3 pages (no more than 10 minutes of reading) 
A list of 10 to 20 entries for prior knowledge 

Criteria 
Content: 

Organizing premise 

Explicit use of prior knowledge, principles and facts (either 
provided or student generated) to explain or elaborate 

Avoidance of misconceptions 

Structure: 

Relevant text references 

Show interrelationships using text and prior information 
Administrative Conditions 
Time: 45-60 minutes 

Resources: Students may refer to text and prior knowledge list 
during essay preparation 

Interaction: None 

Sample Prompt 

Segment of Patrick Henry's speech, plus list of prior knowledge 
measure 

Read the speech taken from the period just before the American 
Revolution. You are supposed to explain to a cousin visiting from 
Canada what Patrick Henry meant and what led him to the 
position he is in. Use help from the list of information to provide 
a clear answer. 

Parallel prompt 

Same except pre-Civil War, Stephen Douglas 
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